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Abstract 

Preconditioning is at the core of modern many-fermion Monte Carlo algorithms, such as Hybrid Monte Carlo, 
where the repeated solution of a linear problem involving an ill-conditioned matrix is needed. We report on a 
performance comparison of three preconditioning strategies, namely Chebyshev polynomials, strong-coupling 
approximation and weak-coupling expansion. We use conjugate gradient (CG) on the normal equations as 
well as stabilized biconjugate gradient (BiCGStab) as solvers and focus on the fermion matrix of the unitary 
Fermi gas. Our results indicate that BiCGStab is by far the most efficient strategy, both in terms of the 
number of iterations and matrix-vector operations. 



1. Introduction 

The Hybrid Monte Carlo (HMC) 0, i] algorithm provides an efficient way of calculating the properties 
of many-body Fermi systems. HMC achieves high efficiency by performing global updates by means of 
Molecular Dynamics (MD) trajectories that are integrated with a finite time step. At each step, the solution 
of a linear problem 

M^Mx = b (1) 

is needed. In this problem, M is a large but typically sparse matrix which provides a real-space representation 
of the fermion operator that defines the dynamics of the system. While M is typically neither symmetric 
nor positive definite, it is always possible to consider solving Eq. (fTJ via conjugate gradient (CG) iteration, 
as M'M satisfies these properties [3|. However, as CG is a Krylov-type method based of successive matrix- 
vector (MV) operations using M'M, the efficiency of CG depends critically on the condition number of that 
matrix. Unfortunately, in most cases of interest M^M may be extremely ill-conditioned, leading to a rapid 
growth of the number of CG iterations required to reach a preset convergence criterion. Such a situation 
manifests itself in many problems at low enough temperatures or strong enough couplings. Apart from 
the obvious disadvantages of a rapidly growing computational cost, any iterative method may eventually 
become unstable if the number of iterations grows without bounds. For these reasons, CG and similar 
Krylov methods are preferentially used together with a suitable preconditioner. It should be pointed out 
that while direct methods such as LU decomposition avoid the problems related to ill-conditioning, problems 
of physical interest are typically large enough such that the storage of M is not feasible. It is thus desirable 
to resort to "matrix-free" methods based on MV operations only, which require significantly less memory 
and scale much more favorably with system size. 

Whenever M is ill-conditioned, the problem is much more severe for M^M . An attractive option is then 
to perform the solve the problem in two steps, by considering the linear systems given by 

M f y = b, Mx = y. (2) 
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However, since M is typically neither symmetric nor positive definite, these equations require more sophisti- 
cated algorithms such as biconjugate gradient (BiCG) Q or stabilized biconjugate gradient (BiCGStab) Q, 
which is a modern development of the former. One is thus dealing with a much less ill-conditioned problem, 
at the price of solving two successive problems. Nevertheless, preconditioning remains instrumental in ac- 
celerating and stabilizing the solution process. In its simplest form, preconditioning amounts to considering 
the linear problem 

PM ] Mx = Pb, (3) 

where P represents an approximation to (M'M) -1 . A similar procedure applies to Eq. @ given approx- 
imations to the inverses of M and ML The procedure in Eq. ([3]) is referred to as "left preconditioning". 
So-called "right preconditioning" is also possible by insertion of P between M^M and x, followed by multi- 
plication with P once the solution has been obtained. In this work, we shall be mainly concerned with the 
former approach. 

While the choice of P is in principle arbitrary, preconditioning is only useful if the solution of Eq. ((3J) to 
a given accuracy requires fewer CG iterations than the solution of Eq. ((T|). Moreover, as iterative methods 
tend to accumulate roundoff error, an iterative solution may not even be possible without preconditioning. 
In such cases preconditioning also serves to stabilize the algorithm by enabling convergence. Two obvious 
extreme choices are those of setting P = I and setting P — (M'M) -1 , however no benefits are achieved in 
either case, as the first one amounts to no preconditioning at all and the second one requires the solution 
of the original problem in order to determine P. Therefore, what one seeks is compromise between the 
computational simplicity of the first option and the effectiveness of the second one. 

Over the years, many preconditioning strategies were developed and tested on a wide range of problems 
(see e.g. Chapter 11 in Ref. [6[, or Chapters 9 and 10 in Ref. [7|). In general, preconditioners specifically 
designed for a particular problem tend to be superior to generic approaches that build in no knowledge 
about the physics of the system in question. With this in mind, we will explore several approaches to 
preconditioning which combine some knowledge of the features of M, and M^M, in order to find a 
strategy which is both computationally efficient and effective in accelerating the convergence of the linear 
problem at hand. 

In this work, we are mainly concerned with problems that arise in the context of HMC simulations 
of non-relativistic, (3+l)-dimensional many-fermion systems. The HMC approach forms a central part of 
state-of-the-art Lattice QCD calculations [8(, which involve a relativistic system of fermions. The popularity 
of HMC stems from the great algorithmic efforts of the Lattice QCD community to enable simulations on 
large lattices. On the other han<L HMC remains little known in other areas such as condensed-matter 
physics [9[ and nuclear structure [lOj, where many of the problems are non-relativistic and determinantal 
Monte Carlo (DMC) [ll| remains the method of choice. The most serious drawback of DMC is the poor 
scaling of the computation time with the lattice volume V, namely ~ V 3 . In contrast, the use of global 
updates and iterative solvers enables the HMC algorithm to scale as ~ V 5 ^ 4 . In spite of this obvious 
advantage, the widespread use of HMC implementations for non-relativistic fermions has been hampered by 
problems related to the operator M^M, which is severely ill-conditioned at low temperatures (see e.g. [12]). 
It is therefore of great interest to explore various preconditioning strategies, as these hold the key to finding 
a practical HMC implementation which is competitive with current DMC calculations. 

The class of problems at hand is characterized by a zero-range interaction, such that the Hamiltonian is 
given by 

H = K + V, (4) 

where the kinetic energy is 




and the potential energy 

V = -9 [ dr (rty. (r)$ (r) V> t (r) , (6) 



2 



such that tpt( r ) an d V'sW are creation and annihilation operators for particles of spin s at point r, respec- 
tively. We next seek to recast the grand canonical partition function 



Z = Tr exp(-P(H - fiN)), 



(7) 



in a form amenable to a Monte Carlo calculation. We proceed by discretizing imaginary time into N T 
slices of length r = j3/N r using a Trotter- Suzuki decomposition 13] followed by a continuous Hubbard- 
Stratonovich transformation (HST) of the form recently proposed by Lee . It should be noted that 
these last two steps are, to a certain extent, a matter of choice. For instance, continuous-time formulations 
exist [l6[ although they have as yet not been combined with HMC. The choice of HST is also dictated by 
computational preferences [l7| . although it is not clear at this time whether an HMC implementation is 
compatible with a discrete HST. The end result is a path integral representation of the partition function 



T>a (det M[a}) 2 = Va det M^[a]M[a], 



(8) 
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where the explicit form of the block matrices 
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(9) 



B 3 [a] = exp(-rK) (l+Asin^-]) 



(10) 



and A = V2y exp(rg) — 1, which results from the specific choice of HST. In momentum space, the kinetic 
energy is given by 

= 5 kM , E(k), (11) 

where we have chosen the dispersion relation E(k) — ft 2 k 2 /2m for our calculations. The integer vec- 
tors (k, k') assume values on a three-dimensional momentum lattice given by k i = 2im i /N x , where n i — 
—N x /2, . . . ,N x /2 — 1. Eq. ([9]) provides a representation of the fermionic operator M referred to earlier, 
which is real and manifestly not symmetric. 

One of the most popular and successful implementations of HMC is known as the (^-algorithm 0] , which 
combines the stochastic evaluation of the fermion determinant in Eq. ([5]) with the MD evolution of the 
auxiliary Hubbard-Stratonovich field a (originally the gauge field in the case of Lattice QCD). To this end, 
the pseudofermion representation 



det M^[a]M[a] = J V$V<j> exp(-5 p [tr]), 



is applied, giving 



S p [a] = 



E 



(12) 



(13) 



for the corresponding action, where Q = M'M. The next step is to introduce an auxiliary field 7r that 
will play the role of a conjugate momentum to the field a in the molecular dynamics evolution. This is 
accomplished by multiplying the partition function by a constant in the form of a path integral over ir, 



Z = J VaVnV^Vfi exp(~H[a, tt]), 



(14) 
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where the MD Hamiltonian is given by 

2 

n,r 

In the pseudofermion formulation, field configurations are sampled by evolving a and 7r according to the 
classical MD equations of motion that follow from the MD Hamiltonian. It should be noted that the 
pseudofermion field (f> may be sampled exactly from a Gaussian heat bath and is kept constant during the 
MD evolution. In practice, the MD evolution should be performed using a reversible symplectic integrator 
in order to ensure detailed balance. At each step in the MD evolution the "fermion force" 

F n , T M] = f CrV] f Q Q- l [(x}4>, (16) 
oer(n, t) 

is calculated, which requires frequent computation of the vector 77 = Q -1 [er]</>. This represents the most 
time-consuming part of the HMC algorithm. One of the most obvious advantages of the ^-algorithm is that 
the direct calculation of a large determinant is avoided, in favor of the solution of a much smaller linear 
problem a fixed number of times. Specifically, in our case this involves the repeated solution of Eq. ((T|) in its 
preconditioned form given by Eq. ([3]) , where the structure of the matrix is constant but the auxiliary field 
a varies at each MD step. This should be contrasted with the cost of computing the full inverse, which is 
much more expensive than solving a single linear problem. By avoiding the calculation of the determinants 
and inverses, this algorithm enables global updates of the auxiliary field a via MD evolution. Moreover, all 
HMC approaches can be rendered free of systematic errors associated with a finite MD integration step size 
by means of a Metropolis accept/reject step at the end of each MD trajectory. 

In Sec. [21 we study the effect of preconditioning via Chebyshev polynomials, which provide an approxima- 
tion to Chebyshev polynomials provide a generic approach which is easy to apply, but computationally 
expensive. In Sec. [31 we proceed to study the weak-coupling approximation (WCA) method of precondition- 
ing M and its transpose separately. In Sec. 0] we explore the strong-coupling approximation (SCA) to Q^ 1 
which was originally proposed in Ref. (l8| . Our findings are summarized in Sec. [SJ followed by a concluding 
discussion in Sec. [51 



2. Chebyshev polynomials 

Chebyshev polynomials have frequently been applied in the preconditioning of rclativistic quantum field 
theory problems. In fact, they have led to the development of a whole new class of HMC-type algorithms 
commonly referred to as Polynomial Hybrid Monte Carlo (PHMC) 20], in which preconditioners are used to 
separate modes that evolve at different rates in an MD trajectory. The Chebyshev preconditioning technique 
is based on approximating z~ x with a Chebyshev polynomial P(z) of degree 2n, 



P(z) 



-2n 



2 1 1 

fc=i 



2n,k)i 



(17) 



where the coefficients are given by 



-2n 



n 

k=l 



-2n,k i 



(18) 



and the roots by 



1 + e 



-2n,k 



1 — COS 
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(19) 
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such that the parameter < e < 1 provides a lower bound for the range of validity of the approximation. 
Indeed, in the interval [e, 1] the Chebyshev polynomials provide a good approximation, in the sense that the 
relative error 

/l — /~\ 2,1+1 

\R 2n J<2\^-^) , (20) 

where 

RmM = [PznM - l l z ] z > ( 21 ) 

is exponentially suppressed with the degree of the polynomial. 

The preconditioner we seek is obtained from Eq. (TTT)) upon replacing z — > M'M, assuming that the 
eigenvalue spectrum of M^M is normalized in the range [e, 1]. One of the most appealing features of the 
Chebyshev preconditioning method is that its speed and effectiveness can be tuned by varying the degree 
of the polynomial n. Moreover, since P is applied in its factorized form, the speed of the preconditioner 
is directly linked to the speed with which M is applied. The order in which the factors of Eq. (|17[) arc 
applied turns out to be critical for numerical stability, due to cancellations between the various terms that 
occur naturally when evaluating a polynomial in finite precision arithmetic. In practice it becomes crucial 
beyond n ~ 20 to address this issue. Various orderings of the factors were explored in Ref. [19| , as well as a 
recursive Clenshaw algorithm which largely eliminates the accumulation of round-off error. Our experience 
indicates that the performance of the "bit-reversal" ordering of Ref. [19j is nearly identical to that of the 
Clenshaw algorithm, and it is slightly faster and requires less memory, in addition to being much simpler to 
implement. The bit-reversal ordering will therefore be our method of choice. 

A serious disadvantage of the Chebyshev approach is that it is only applicable to the preconditioning of 
the normal equations, as it requires the matrix to have positive definite eigenvalue spectrum. Also, as M*M 
becomes extremely ill-conditioned at low temperatures, polynomial orders larger than n ~ 64 are needed for 
effective preconditioning, which makes the Chebyshev approach rather computationally intensive. As will 
become evident in Sec. [3 this problem can largely be avoided by separately inverting M and M' , which 
however requires a different type of preconditioner. We shall now turn to the description of such strategies. 



3. Weak coupling expansion 

In this section, we demonstrate how the weak-coupling limit of M can be used as the starting point of 
an efficient preconditioning strategy for M and M'. It is useful to note that the representation of M given 
in Eq. ([SJ can be decomposed as 



M = M -I- AM X 

where M = lim^-^o M corresponds to the non-interacting case, and 



(22) 



M x = 











-c, 



-c 



N-2 



-c 



c 








(23) 



where the block matrices are given by 



C = cxp(— tK) sin[cr 



(24) 



The inverse M 1 can then be formally expanded in powers of AX, giving 



AT 1 = (1 + AX^M- 1 = (1 - AX + A 2 X 2 -... + ...) M~ 



(25) 
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where X = M ~ 1 M 1 . One can then determine a posteriori whether the truncated series represents a good 
approximation to M , which is a reasonable expectation for A <C 1. The most computationally intensive 
part of such an approximation is the application of Mq 1 . However, this matrix is diagonal in frequency- 
momentum space. It is thus convenient to apply Mq 1 after a four-dimensional Fourier transform of the 
relevant vector, followed by an anti-Fourier transform to return to the original basis. This approach is 
referred to as Fourier acceleration. In frequency- momentum space, Mq 1 is given by 

[Mo]Zw t u ~ <L,*A,k' [1 ~ exp (-iw - Th 2 k 2 /2m)] _1 , (26) 

up to a constant normalization factor, with 

u = (2n T + l)ir/N r , k t =2n l7 r/N x , n„ = -JV„/2, . . . , - 1. (27) 

Care must be taken that the boundary conditions in the temporal direction are antiperiodic, as befit fermions, 
rather than periodic as commonly assumed in FFT routines. A possible complication arises due to the fact 
that the radius of convergence of Eq. (|25[) can be quite limited depending on the spectrum of X, which is 
unknown for all practical purposes. We therefore turn to techniques which may accelerate the convergence 
of the expansion. 



3.1. Convergence acceleration methods 

Here, we will briefly review the practical aspects of the Euler and van Wijngaarden methods for con- 
vergence acceleration of series expansions, without concerning ourselves with the underlying theory. The 
interested reader is referred to the standard literature, see e.g. Refs. [U, [H, [13] . Given a sequence of partial 
sums (in this case of sign-alternating terms) 



S 0tk = J2(-l) n a ri 



(28) 



where k = 0, . . . , fc max , the Euler method consists of defining a set of new sequences given by 



= pSj,k + (1 - p)Sj, 



(29) 



where k — 0, 



k 

) max 



(j' + l). According to this method, instead of using S a k as an estimate of the 
infinite sum, one can arrive at a much better estimate as follows. Use the original sequence to define new 
ones using Eq. (f2Uf until all the terms available have been exhausted. The last sum, namely S k will 
consist of only one term, which is the estimate we seek. In practice one usually takes p = 1/2, and it is not 
uncommon to achieve convergence improvements by several orders of magnitude with a handful of terms as 
a starting point. It has also been pointed out in the literature that S 2 k 



„/3,/c max /3 



is often an even better 



approximation than S k . We shall apply this method to the series that results of applying Eq. (|25[) to a 
given vector. 

The sequence Sj is commonly referred to as the Euler transform of the original sequence 5*0 k , and it 
can be shown to be given by 



where 



3=0 V 7 ' 



(30) 



(31) 



The idea of van Wijngaarden was to first multiply each term in the original series by non- vanishing arbitrary 
constants \ k and then perform an Euler transformation. It was assumed that a moment generating function 
4>(t) exists such that 



cj)(t) t k dt, 



(32) 



and it was shown that the sum of the original series is given by 



n=0 n=0 

where 

poo 

»> = J dtm F^F ' (34) 

In practice, van Wijngaarden's transformation can turn a slowly convergent or even divergent series into a 
rapidly convergent series. It can also be shown that the so-called special van Wijngaarden transformation, 
for which = a /k\ and <fi(t) — sexp(-st), where s is a free parameter, does not change the Borel sum of 



the original series and corresponds to the Laplace transform of the Euler transformation [22 1 



4. Strong coupling approximation 

The idea of constructing a preconditioner based on a strong-coupling approximation was originally devel- 
oped by Scalettar et al. in Ref. [l8[ . It consists of constructing the inverse of a strong-coupling approximation 
to M'M, which we shall denote by M T M, where M has the same structure as M but with the substitution 



B: 



Bo-i = 1 + ^4sin[cr], 



(35) 



where it should be noted that B j is diagonal in coordinate space. In this approach, all the blocks will have 
this property, which makes it extremely inexpensive from a computational point of view. In order to find 
the inverse, one defines a factorization 



where 



it 



-Li 





V l Nt 



M^M = L^DL, 




1 

-L 2 










\ 








1 / 



(36) 



(37) 
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( D l 
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D, 
















V o 











D 







for which it is straightforward to show that the individual blocks are given by 



(38) 



L t = 
D t = 



D7 B, 



OA I 



(39) 
(40) 



for t = 1 , . . . , N and 



= a 



R 2 

n 0,N r 



B, 



N T -1^0,N T -l 



B 0,N T D 1 



B, 



(41) 
(42) 
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where the parameter a is tuned such that the number of CG iterations is minimized. The matrix 

L~ 1 D~ 1 L J< ~ 1 = (M^M)' 1 (43) 

then plays the role of the preconditioner in this approach. Notice that the matrices L and are trivial to 
invert, in the sense that the linear problems Lx = y and Ux = y can be solved in a fast and straightforward 
fashion. For example, X — L'^Y is given by the recursive expression 

x(r,l) = y(r,l), 

x(r,2) = y(r,2) + L lX (r,l), 

x(r,3) = y(r,3)+L 2 x(r,2), 

x(r : N T ) = y{r 1 N T ) + L Nr _ 1 x{r,N T -l)-L NT x{r,l). (44) 

While Ref. [l8[ found this approach promising, it presents at least two disadvantages. Firstly, it is not easy 
to improve in a systematic fashion, in contrast with the Chebyshev approximation. Indeed, one of the key 
properties of this preconditioner is that it only involves diagonal matrices. Any improvement involving the 
kinetic energy operator will eliminate this property and make the approach much more computationally 
expensive. Secondly, just like the Chebyshev method, this preconditioner aims at approximating the inverse 
of M^M, whereas inverting M and M* separately is preferable due to the dramatic increase in the condition 
number of M^M relative to M. However, it should be pointed out that this preconditioner is computationally 
very inexpensive and can furthermore be combined with the Chebyshev preconditioner, using the latter as 
a secondary filter. It is not yet known whether such a combined approach is useful in practice. 

5. Results 

The number of CG or BiCGStab iterations required to solve the linear problems of Eqs. (JTJ) and © 
is given in Table [JJ for the various preconditioners, with a more comprehensive study of the scaling of the 
second-order weak-coupling preconditioner (WCE2) in Fig. [TJ The convergence criterion, referred to as the 
tolerance parameter, was taken to be a reduction in the norm of the residual by a factor of 10~ 12 . The 
achieved absolute accuracy is approximately constant for each space-time volume N% x N T , but deteriorates 
slowly as N x and N T are increased. However, it should be noted that in typical HMC simulations, a tolerance 
parameter of ~ is considered adequate. Our test runs have been performed using a imaginary time 

step of r — 0.05, with couplings of g — 2.5, 5.0 and 7.5. The corresponding values of A are ~ 0.52, 0.75 
and 0.95. The results shown in Tables [TJ and [5] correspond to g = 5.0, which is close to the unitary limit. 

We have used auxiliary field configurations a with randomly distributed entries in the interval [— 7r,7r] 
in order to characterize the performance of the preconditioners. While this is the proper interval for the 
chosen HST, the distribution differs from that of a thermalizcd HMC simulation, where the values of a tend 
to cluster around — it/ 2 and n/2. However, in practice the net effect of using thermalized instead of random 
configurations is to increase the number of CG iterations by a factor of ~ 2 across all parameter values. We 
have confirmed this behavior in a number of cases, and we therefore conclude that the relative merits of 
the preconditioners remain unaffected by the issue of thermalized versus random configurations. We have 
also found that 10 random field configurations suffices to provide an estimate of the average number of CG 
iterations with an uncertainty of less than 10— 15% in all cases. The uncertainly can be further suppressed by 
increasing the number of sample configurations. So far, we have made limited use of Euler-Van Wijngaarden 
acceleration techniques. For the second-order weak-coupling expansion (WCE2), the simplest Euler method 
yielded a moderate but definite improvement over the non-accelerated implementation. For the fourth-order 
expansion (WCE4), the Euler method was found to reduce the number of iterations by up to a factor of 2. 
Finally, we have varied the right-hand side from a constant vector to one with randomly distributed entries 
and found no impact on the results. 

The number of MV operations consumed by each method is shown in Table [3J Per iteration, CG on 
the normal equations requires one application of M*M as well as the preconditioner P(M*M). On the 
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Table 1: Summary of the number of iterations required for the solution of M'Mx = 6 with CG, or M*y = b followed by 
Mi = y with BiCGStab, for a tolerance of 10 — 12 and a coupling of g = 5.0 (A ~ 0.75), which is close to the unitary limit. The 
columns labeled "Cheb8", u Chebl6" and "C7ie632" denote, respectively, Chebyshev preconditioning of degrees 8, 16 and 32 
using bit-reversal ordering of the roots. The strong-coupling preconditioner is labeled U SCA" , and " WCEO" , " WCE1 " and 
" WCE2" denote the weak-coupling expansion of order 0, 1 and 2 in A, respectively. The reported numbers represent averaged 
over 10 random auxiliary field configurations, which yields an uncertainty below 10-15%. A line denotes the cases that failed 
to converge in less than 10,000 iterations. 



Lattice size 




CG 








BiCGStab 




N* x N T 


Cheb8 


Chebl6 


Cheb32 


SCA 


WCEO 


WCE1 


WCE2 


6 3 x 50 


294 


167 


104 


1,790 


128 


77 


55 


8 3 x 50 


386 


226 


130 


2,388 


147 


84 


62 


10 3 x 50 


489 


259 


163 


2,810 


152 


93 


67 


6 3 x 100 


534 


316 


187 


3,470 


162 


97 


71 


8 3 x 100 


875 


460 


288 


5,223 


183 


114 


78 


10 3 x 100 


1,102 


610 


382 


7,104 


196 


118 


88 


6 3 x 200 


864 


530 


276 


5,987 


208 


119 


87 


8 3 x 200 


3,301 


998 


604 




246 


148 


105 


10 3 x 200 




1,306 


735 




279 


162 


121 



other hand, BiCG and BiCGStab require the application of both M and M* (as well as their respective 
preconditioners) for each iteration. Thus, in the absence of a preconditioner, the cost per iteration of CG, 
BiCG and BiGStab is the same. The main advantage of using the normal equations with CG is that one 
avoids the need to successively solve two linear problems. The MV operations performed per CG iteration 
when using Chebyshev preconditioning can be estimated as follows: the application of a polynomial of degree 
d to a vector requires 2d MV operations; adding the application of the matrix M^M itself yields a total of 
2(d + 1) operations. Estimating the computational cost of the weak-coupling preconditioner is somewhat 
more involved. The BiCG or BiCGStab iterations contribute two MV operations from the application of 
M> and M. For each power of A, applying K contributes one MV operation to the total cost. However, the 
non-interacting matrix M is applied using a four-dimensional FFT, which scales as AT 3 x N T x log(V 3 x N T ), 
which differs from the iV 3 x N T x log(V 3 ) scaling of the MV operations. The overall computational cost 
of the approach using the weak-coupling preconditioner of degree d in A is given by 2(d + 1)(1 + 2/3) MV 
operations, where f3 is the factor in front of the scaling law for the 4D FFT, relative to that of the 3D 
FFT. While we have not attempted to estimate j3 directly (it is implementation-dependent), we find that 
in practice the gain in CPU time can be lower by a factor of ~ 2 compared with the gain factors quoted in 
Tabled where we set (3 — 1. Notice however that the gain is still substantial, especially as the total number 
of MV operations increases only mildly as a function of N T . 

While the strong-coupling approximation is extremely inexpensive from a computational point of view, 
its effect on the solution process is rather mild, at least for the values of g considered here. Thus, the number 
of CG iterations remains rather high, which is an undesirable feature as the iterative solution process is prone 
to accumulate numerical round-off error after seveal hundred iterations, which is especially problematic for 
the BiCG and BiCGStab solvers. As a rule of thumb, in our HMC calculations we aim to keep the iterations 
below the 200 — 300 range. Though computationally cheap, a preconditioner which is unable to reduce the 
iteration count below ~ 1000 is of limited practical usefulness, especially at large N T where the number 
of iterations may increase dramatically. The effective cost of the strong-coupling preconditioner in terms 
of MV operations is difficult to estimate, as it involves no FFTs. We have found, empirically, that the 
number of MV operations required per CG iteration can be conservatively estimated to be ~ 3. In light of 
the exceedingly high iteration count shown in Table [T] for this preconditioner, however, this MV estimate is 
irrelevant. 

We have also tested the weak-coupling approximation with CG and found that it provides little benefit, 
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450 



N 



Figure 1: BiCGStab iterations as a function of N T and N x , for the solution of M^y = b followed by Mx = y, for a tolerance of 
10~ 7 and a coupling of g = 5.0 (A ~ 0.75), using the second-order weak-coupling expansion (WCE2) as preconditioner. The 
results represent averages over 20 random auxiliary field configurations. 

in contrast to the dramatic improvement when used with BiCGStab. This situation arises as we have only 
attempted left preconditioning with CG, and thus the full potential of the preconditioning strategy is not 
realized. For this to be the case, one would need to precondition with (Mq)^ 1 on the left and with M^ 1 
on the right. In the case of BiCGStab this complication does not appear, as M and MT are preconditioned 
separately. Finally, in Table [3] we show the behavior of the weak-coupling preconditioner as a function of the 
coupling, for g — 2.5, 5.0 and 7.5. We note that the case of g = 10.0 (A ~ 1.14), typically failed to converge 
in less than 15,000 MV operations. 

6. Summary and Conclusions 

In this work, we have evaluated three different strategies to precondition the non-relativistic many- 
fermion problem: Chebyshev polynomials, a strong-coupling approximation and a weak-coupling expansion. 
We have argued that such preconditioning forms a central part of HMC calculations, as the frequent solution 
of a linear problem involving the ill-conditioned fermion matrix M is at the heart of the HMC algorithm. We 
have considered the cases of normal equations M^Mx = b which is tractable using the CG algorithm, as well 
as the two-step approach of solving M^y = b followed by Mx = y, using the BiCG or BiCGStab algorithms. 
Our results indicate that both the Chebyshev polynomials (Sec. [2]) and the weak-coupling expansion (Sec. [3]) 
can be effective preconditioners, especially when high orders are used. However, from the point of view of 
performance, Chebyshev polynomials represent an expensive choice. Additionally, the Chebyshev approach 
requires a matrix with a positive definite eigenvalue spectrum, which forces us to work with the CG solver 
on the normal equations. This is unfortunate, as M tends to be a rather ill-conditioned matrix (especially 
at low temperatures), which makes the normal system Qx — b extremely ill-conditioned. The result is that 
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Table 2: Summary of the number of matrix-vector (MV) operations required for the solution of M* Mx = b with CG, or 
M^y = b followed by Mx = y with BiCGStab. Notation and parameters are as for Tabled The rightmost column illustrates 
the gain provided by the lowest-order weak-coupling expansion using BiCGStab over the most favorable case using CG. As 
explained in the text, the actual gain in CPU time is likely to be somewhat less. Notice in particular the mild increase in the 
number of MV operations for BiCGStab as a function of JV^. and N T . 



Lattice size 




CG 








BiCGStab 






N* x N T 


Cheb8 


Chebl6 


ChebS2 


SCA 


WCE0 


WCE1 


WCE2 


Gain 


6 3 x 50 


5,292 


5,678 


6,864 


5,370 


768 


924 


990 


7x 


8 3 x 50 


6,948 


7,684 


8,580 


7,164 


882 


1,008 


1,116 


8x 


10 3 x 50 


8,802 


8,806 


10,758 


8,430 


912 


1,116 


1,206 


9x 


6 3 x 100 


9,612 


10,744 


12,342 


10,410 


972 


1,164 


1,278 


lOx 


8 3 x 100 


15,750 


15,640 


19,008 


15,669 


1,098 


1,368 


1,404 


14x 


10 3 x 100 


19,836 


20,740 


25,212 


21,313 


1,176 


1,416 


1,584 


17x 


6 3 x 200 


15,552 


18,020 


18,216 


17,961 


1,248 


1,428 


1,566 


12x 


8 3 x 200 


59,418 


33,932 


39,864 




1,476 


1,776 


1,890 


23x 


10 3 x 200 




44,404 


48,510 




1,674 


1,944 


2,178 


27x 



Table 3: Effectiveness of the second-order (WCE2) and fourth-order (WCE4) weak-coupling preconditioners as a function of 
g. For g = 2.5, we have A ~ 0.52, for g = 5.0, A ~ 0.75 and for g = 7.5, A ~ 0.95. The figures given represent the number of 
matrix- vector operations required to solve M^y = b followed by Mx = y with BiCGStab. These are averages over 10 random 
configurations, with an uncertainly in the range of 10-15%. 



Lattice size 
JV» x N T 


9 = 


2.5 


.9 = 5 




9 = 


7.5 


WCE2 


WCE4 


WCE2 


WCE4 


WCE2 


WCE4 


10 3 x 50 


630 


600 


1,206 


1,320 


2,358 


4,110 


10 3 x 100 


702 


690 


1,584 


1,530 


3,816 


9,330 


10 3 x 200 


864 


780 


2,178 


2,190 


6,444 


12,720 



the number of MV operations required to solve the problem with a preset accuracy grows rapidly with the 
size of the problem, in particular with N T , which controls the temperature. 

We have shown that solving M'y = b followed by Mx = y using the BiCGStab algorithm provides an 
alternative, extremely promising approach. The use of a weak-coupling preconditioner supplemented with a 
simple convergence acceleration technique solves the problem elegantly and provides dramatically enhanced 
performance in terms of both number of iterations and MV operations, which translates into significant 
savings of CPU time and improved scaling of the HMC algorithm at low temperature. While we have only 
provided a first, sketchy comparison of the two methods, the difference between using CG with Chebyshev 
preconditioning and BiCGStab with weak-coupling preconditioning is so striking that the latter technique is 
the obvious choice. The weak-coupling preconditioner represents a good example in which knowledge of the 
structure of the matrix, as well as the physical problem at hand, allows for the construction of an effective 
and efficient strategy to accelerate the solution process. 

The strong-coupling approximation of Ref. |18| provides an alternative to Chebyshev preconditioning. 
While computationally inexpensive, this preconditioner turned out to be less efficient in terms of its ability 
to reduce the number of CG iterations. Apparently, this approach fails to capture the relevant physics at 
the range of couplings in the vicinity of the unitary limit. In an effort to shed more light on this problem, we 
explored the behavior of the weak-coupling preconditioner as a function of the coupling g, and found that its 
effectiveness breaks down somewhere between g = 7.5 and 10. This situation might be remedied by using a 
strong-coupling expansion in powers of A , although our attempts to apply such an expansion have so far 
not been successful. However, we believe this to be a topic worthy of further investigation. Alternatively, one 
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can reduce the value of the imaginary time step r, which serves to extend the usefulness of the weak-coupling 
expansion to larger values of g, given that the relevant expansion parameter is A — \/2\/cxp(Tg) — 1. 
However, this has the drawback of increasing the value of N T required to achieve a given temperature, which 
can be taxing on the method. 

Various cases of interest that could and should be studied lie beyond the scope of this work. Among 
these are the case of finite effective range, relevant for nuclear and neutron matter calculations, which would 
make the potential energy operator more dense in real space (it is diagonal for the zero-range interaction 
considered here). Other dispersion relations, such as that of the conventional Hubbard model should be 
studied as well, particularly since this might open the door to large-scale HMC simulations in the fields of 
solid-state and atomic physics. Finally, we have focused here on a problem in 3 + 1 dimensions, although 
the methods employed in this study carry over directly to applications in lower dimensions. 
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